Tennis is one of the most popular racquet-based sports played worldwide by countless individuals - from recreational players to elite athletes. Although the sport is loved and played by many, the professional men's tennis scene has been dominated by three top players for the past decade (Wertheim, 2019). Often referred to as 'the Big Three', these athletes are consistently winning the largest tournaments around the world, making it extremely difficult for other players to make a living off of the sport (2019).
With this in mind, our project focuses on one of the most controversial factors in men's tennis: players’ prize money. The wealth disparity in professional tennis is more drastic than other major sports, "with the best players traveling with entourages aboard private jets, and a good chunk of the field trying to break through without going broke" (Gay, 2019). To further examine this anomaly, we will study how prize money is affected by factors such as a player’s age, current ranking, and seasons played.
With data collected by “Ultimate Tennis Statistics” from the year 2019, this Player Stats dataset (player_stats. csv) provides information on top-ranking tennis players all around the world, with columns including age, ranking, nationality, and many more. To examine the relationship between prize money and various factors, we selected prize money, age, current ranking, and seasons-played as our main variables.
install.packages("plotly")
library(tidyverse)
library(GGally)
library(caret)
library(plotly)
# Load in the original data set
url <- "https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS"
download.file(url, destfile = "player_stats.csv")
player_stats <- read_csv("player_stats.csv")
glimpse(player_stats)
head(player_stats)
After loading the Tennis Statistics data set, we see a significant amount of available information that are needed to be cleaned and wrangled for proper analysis.
The first step involved cleaning up the column names with make.names() for easier identification and the removal of unneccesary variable columns. We cut out the columns that were missing most values or were irrelevant to our final analysis. This left us with the columns Age, Seasons, Current.Rank, Best.Rank, and Prize.Money.
The second part involved cleaning up the values in the columns itself. We wanted these columns to include only numerical values, so we removed unnecessary spaces and extra values. We also used the as.numeric() function to convert each column to numerical.
The final column, Prize.Money, was the most challenging one to clean. We noticed many inconsistencies in formatting such as the occasional inclusion of either $ or US$ symbols as well as unnecessary commas and words. To remove everything other than the numerical amounts, we used the mutate() and gsub() functions to cut out specific characters and then used separate() to split the values with delimiters, removing the extra column with the irrelevant strings.
Finally, we added a Prize.Money.in.million column for easy readability in the distribution graph before creating the final data frame for analysis and processing.
# Cleaning Un-needed Columns
colnames(player_stats) <- make.names(colnames(player_stats))
player_stats <- select(player_stats,
-c(X1, Name, Current.Elo.Rank:Tour.Finals, Plays, Wikipedia, Backhand, Favorite.Surface,Active, Height, Turned.Pro))
# Cleaning Prize.Money Column
player_stats <- player_stats %>%
mutate(Prize.Money = gsub("\\$", "", Prize.Money)) %>%
mutate(Prize.Money = gsub("US", "", Prize.Money)) %>%
mutate(Prize.Money = gsub("\\,", "", Prize.Money)) %>%
mutate(Prize.Money = gsub("^\\s+|\\s+$", "", Prize.Money)) %>%
separate(col = Prize.Money , into = c("Prize.Money", "extra"), sep = " ") %>%
select(-extra) %>%
mutate(Prize.Money = as.numeric(Prize.Money)) %>%
mutate(Prize.Money.in.million = Prize.Money /1000000)
# Cleaning Age Column
player_stats <- player_stats %>%
separate(col = Age , into = c("Age", "extra"), sep = " ") %>%
select(-extra) %>%
mutate(Age = as.numeric(Age))
# Cleaning Current.Rank Column
player_stats <- player_stats %>%
separate(col = Current.Rank , into = c("Current.Rank", "extra"), sep = " ") %>%
select(-extra) %>%
mutate(Current.Rank = as.numeric(Current.Rank))
# Cleaning Best.Rank Column
player_stats <- player_stats %>%
separate(col = Best.Rank , into = c("Best.Rank", "extra"), sep = " ") %>%
select(-extra) %>%
mutate(Best.Rank = as.numeric(Best.Rank))
# Removes NA rows and selects important columns and
player_stats_selected <- player_stats %>%
select(Age, Seasons, Current.Rank, Best.Rank, Prize.Money, Prize.Money.in.million) %>%
na.omit()
head(player_stats_selected)
# Summary statistics
player_stats_summary <- summarize(player_stats_selected,
min_Prize = min(Prize.Money),
max_Prize = max(Prize.Money),
mean_Prize = mean(Prize.Money),
median_Prize = median(Prize.Money),
IQR = IQR(Prize.Money),
sd = sd(Prize.Money))
player_stats_summary
The summary statistics above show that the values of Prize.Money are very widely distributed (spread out from $145 to $139,144,944) with an interquartile range of $2,822,774 and a standard deviation of $12,795,834.
In order to determine how price money is affected by various predictors, we need to look deeper into its distribution; to do this, we visualized the distribution of Prize.Money.in.million using a density plot and a histogram.
options(repr.plot.width = 6, repr.plot.height = 4)
# Density plot for prize value distribution
prize_distribution_plot <- ggplot(player_stats_selected, aes(x = Prize.Money.in.million)) +
geom_density(fill = "lightblue",
color = 'steelblue') +
labs( x = "Prize Money (in millions)", y = "Distribution density") +
ggtitle("Figure 1.3.1 Prize Money Distribution(Density)")
prize_distribution_plot
# Histogram for prize value distribution
prize_hist <- player_stats_selected %>%
ggplot(aes(x = Prize.Money.in.million)) +
geom_histogram(bins = 50,
position = "identity",
fill = "lightblue",
color = 'steelblue') +
labs(x = "Prize Money (in millions)", y = "Count of players") +
ggtitle("Figure 1.3.2 Prize Money Distribution")
prize_hist
As shown in the plots above, the distribution of the prize money is extremely skewed due to extreme outliers, further emphasizing the drastic wealth disparity between the top few players and everyone else. Since the prize money is the centre of focus for our project, this extremely unbalanced distribution is a major obstacle. Therefore, it is important to keep this in mind as we carry out the rest of our analysis, given that these outliers could influence our results.
At this stage, we are considering using a multivariate prediction model with 4 predictors while also deciding between classification and regression. To narrow down our list of options, we plotted all the potential predictor variables in one ggpair plot to assess the linear relationships between Prize.Money and each of the four predictors: Age, Seasons, Current.Rank, and Best.Rank.
options(repr.plot.width = 9, repr.plot.height = 6)
ggpair_plot <- player_stats_selected %>%
select(-Prize.Money) %>%
ggpairs() +
ggtitle("Figure 1.4 Correlation with All Predictors")
ggpair_plot
After analyzing Figure 1.4, we noticed the four selected predictors have much lower correlations with Prize.Money than we expected. In response, we made the following decisions for building the prediction model:
Remove the ‘Best.Rank’ predictor:
Seasons, only has a moderate correlation (0.447).Age and Current.Rank have weak correlations of 0.31 and -0.307 respectively. Best.Rank has the weakest correlation (-0.299) with Prize.Money; For this reason, we decided to exclude Best.Rank from the predictors. Use classification instead of regression:
Prize.Money, so a linear regression model would be inaccurate for making predictions.
Regarding the low correlation values, we also decided against using K-nearest neighbour (k-nn) regression, as classification would give us a more intuitive prediction (i.e. a range) rather than the exact quantitative values of k-nn regression. Predicting a particular number is less reliable than predicting a general range of values in which that value could lie.Therefore, we decided to categorize the numeric values in Prize.Money into separate tiers to build a k-nn classification model.
We decided to divide the prize money using the quantile() function.
Initially, we just tried to divide the prize values into five brackets and cut them at $1/6$, $2/6$, $4/6$, $5/6$. These give us the folowing percentiles:
# Retrieving prize money information for the quantile function
prize_money_percentiles = player_stats_selected %>%
select(Prize.Money) %>%
na.omit() %>%
pull()
# Making a percentile with respect to 16.5% 33% 67% and 83.5%
quantile(prize_money_percentiles, prob=c(.165,.33,.67,.835))
# Creates categorization using these percentiles
player_stats_bad_classification <- player_stats_selected %>%
mutate(prize.money.classified = ifelse(Prize.Money < 117744.6, "Really Low Amount (Less than $117,744.60)",
ifelse(Prize.Money <302981.7, "Low Amount, (Less than $302,981.70)",
ifelse(Prize.Money < 1591534.75, "Medium Amount",
ifelse(Prize.Money < 5258947.875, "High Amount, (More than $1,591,534.75)",
"Very High Amount (More than $5,258,947.88)")))))
# Refactors categorization in correct order
player_stats_bad_classification$prize.money.classified = factor(player_stats_bad_classification$prize.money.classified, levels =
c("Really Low Amount (Less than $117,744.60)",
"Low Amount, (Less than $302,981.70)",
"Medium Amount",
"High Amount, (More than $1,591,534.75)",
"Very High Amount (More than $5,258,947.88)"))
# Plot unfavourable percentiles for comparison
plot_bad = player_stats_bad_classification %>%
ggplot(aes(x = prize.money.classified)) +
geom_bar() +
ggtitle("Figure 2.1.1 Unfavourable categorization for comparison") +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
labs(x = "Prize Money Classes", y = "Number of players")
# Heads our dataframe
head(player_stats_bad_classification)
However, Figure 2.1.1 below shows that our intial select creates an inefficient distribution of how the prize money can be classified. Based on what we learned from the origianl distribution and what we observed here, we decided to modify the percentiles in more details.
options(repr.plot.width = 6, repr.plot.height = 5)
plot_bad
After seeing the results of dividing the data the previous way, the "low" and "high" earning brackets had a wide range of earnings relative to the range of the "medium" earnings bracket. In particular, the "high" earning bracket (67% - 100%) had an abnormally wide range of earning amounts as it included the extreme outliers as well as moderate values. Thus, we decided to split the data into more detailed classification brackets:
# Making a percentile with respect to 10% 33% 67% and 90%
quantile(prize_money_percentiles, prob=c(.1,.33,.67,.9))
# Re-categorize Prize.Money
player_stats_classes <- player_stats_selected %>%
mutate(prize.money.classified = ifelse(Prize.Money < 65411, "Really Low Amount (Less than $65411.00)",
ifelse(Prize.Money <302981.7, "Low Amount, (Less than $302981.70)",
ifelse(Prize.Money < 1591534.75, "Medium Amount",
ifelse(Prize.Money < 8548203.5, "High Amount, (More than $1,591,534.75)",
"Very High Amount (More than $8,548,203.50)"))))) %>%
mutate(prize.money.classified = as.factor(prize.money.classified))
# Re-categorize price money classes
player_stats_classes$prize.money.classified = factor(player_stats_classes$prize.money.classified, levels =
c("Really Low Amount (Less than $65411.00)",
"Low Amount, (Less than $302981.70)",
"Medium Amount",
"High Amount, (More than $1,591,534.75)",
"Very High Amount (More than $8,548,203.50)"))
# Plot final categorization
prize_class_barplot <- player_stats_classes %>%
ggplot(aes(x = prize.money.classified)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 70, hjust = 1))+
labs(x = "Prize Money Classes", y = "Number of players")+
ggtitle("Figure 2.1.2 Final categorization of prize money")
# Heads our dataframe
head(player_stats_classes)
Figure 2.1.2 shows the distribution of the quantile using our new percentiles. Now, one can see that we achieve a much more normal distribution.
options(repr.plot.width = 6, repr.plot.height = 5)
prize_class_barplot
For classification, we need to rebalance the data, as the readings in some classifications such as the 0% -10% bracket and 90-100% bracket have a considerably lower number of observations than other brackets. Thus, we applied oversampling ( upSample() ) the data to ensure equal voting power for all classes.
player_stats_oversampled <- upSample(x = select(player_stats_classes, Seasons, Current.Rank, Age, Prize.Money, prize.money.classified),
y = select(player_stats_classes, prize.money.classified) %>% unlist())
glimpse(player_stats_oversampled)
options(repr.plot.width = 9, repr.plot.height = 5)
prize_oversample_barplot <- player_stats_oversampled %>%
ggplot(aes(x = prize.money.classified)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 55, hjust = 1))+
labs(x = "Prize Money Classes", y = "Number of players")+
ggtitle("Figure 2.2 Distribution after oversampling")
prize_oversample_barplot
After oversampling, the number of observations in each class are equal, as shown in Figure 2.2 above.
# Creating Training and Test Sets
set.seed(2020)
training_rows <- player_stats_oversampled %>%
select(prize.money.classified) %>%
unlist() %>%
createDataPartition(p=0.75,list=FALSE)
X_train <- player_stats_oversampled %>%
select(Age,Seasons,Current.Rank) %>%
slice(training_rows) %>%
data.frame()
Y_train <- player_stats_oversampled %>%
select(prize.money.classified) %>%
slice(training_rows) %>%
pull()
X_test<- player_stats_oversampled %>%
select(Age,Seasons,Current.Rank) %>%
slice(-training_rows) %>%
data.frame()
Y_test <- player_stats_oversampled %>%
select(prize.money.classified) %>%
slice(-training_rows) %>%
pull()
In this section, we standardize and scale the columns we will use for our plots. The values in the columns Current.Rank and Age have considerably different ranges, as rankings range from one to several hundreds, while age is roughly limited between 20 and 40 years. To ensure that both parameters have an equal influence on the predictions, we used a scale_transformer to standardize the data.
# Scaling Training and Test Set
set.seed(2020)
scale_transformer <- preProcess(X_train, method = c("center", "scale"))
X_train_scaled <- predict(scale_transformer, X_train)
X_test_scaled <- predict(scale_transformer, X_test)
head(X_train_scaled)
head(X_test_scaled)
Next, we perform cross-validation with n = 10 to select the best k value for our classifer. Cross-validation provides us the unbiased estimate of the k by splitting the data in multiple equal splits and using the average of the splits as the estimate. Consequently, the k with the best average accuracy will be chosen for us by cross-validation.
# input multiple k values, the model will evaluate the best k value to use
set.seed(2020)
ks = data.frame(k = seq(from = 1, to = 51, by = 2))
train_control <- trainControl(method='cv',number = 10)
knn_model_cv_10 <- train(x=X_train_scaled,
y=Y_train,method='knn',
tuneGrid=ks ,
trControl=train_control)
# plotting k against model accuracy
set.seed(2020)
accuracies <- knn_model_cv_10$results
accuracy_vs_k <- ggplot(accuracies,aes(x=k,y=Accuracy))+
geom_point()+
geom_line()+
labs(x='k value',y='Model Accuracy') +
ggtitle("Figure 2.3.3 Model accuracy v.s K values")
accuracy_vs_k
set.seed(2020)
best_k <- knn_model_cv_10$results %>%
filter(Accuracy == max(Accuracy)) %>%
select(k) %>%
pull()
best_accuracy <- knn_model_cv_10$results %>%
filter(Accuracy == max(Accuracy)) %>%
select(Accuracy) %>%
pull()
best_k
best_accuracy
By graphing the accuracy results of our n = 10 cross validation model, we could confirm that our accuracy value is the highest at k = 1 and then proceeds to drop immediately afterwards. Therefore, we choose k = 1 as the optimal value, as we have concluded that the ~80% accuracy gives us a precise enough model.
# Re-train the final model with the selected K
set.seed(2020)
k = data.frame(k=best_k)
knn_model_best <- train(x = X_train_scaled,
y = Y_train,
method = 'knn',
tuneGrid = k )
print(knn_model_best)
# Predict on test set using retrained model
set.seed(2020)
Y_predicted <- predict(object = knn_model_best, X_test_scaled)
# Comparing predicted classification with actual classification of the test data
model_quality <- confusionMatrix(data = Y_predicted, reference = Y_test)
model_quality
model_quality$overall[1]
With the retrained final classifier, our model predicted the test data set with a prediction accuracy of over 80%.
This prediction accuracy validated our decision to keep three predictors instead of reducing to two. We tried the model with different pairs of predictors but the prediction accuracy with all three variables was notably higher than the others. Intuitively, using all three variables makes the most sense because the amount of prize money a player makes is not solely dependent on one or two variables; there are many factors that contribute to how much money someone makes, therefore predictions should be made accordingly.
Based on the results of the confusionMatrix, our model perfectly predicted the Really Low Amount class. In contrast, the model made a few prediction errors in the other classes ( Low Amount , Middle Amount, High Amount and Really High Amount).
options(repr.plot.width = 8, repr.plot.height = 5)
training_oversampled_scaled <- bind_cols(X_train_scaled, data.frame(price_class = Y_train))
prize_class_seasons_vs_age <- plot_ly(training_oversampled_scaled, x = ~Seasons, y = ~Age, z = ~Current.Rank, color = ~price_class, size = 5, opacity = 0.8)
prize_class_seasons_vs_age <- prize_class_seasons_vs_age %>% add_markers()
prize_class_seasons_vs_age <- prize_class_seasons_vs_age %>%
layout(scene = list(xaxis = list(title = 'Seasons Played'),
yaxis = list(title = 'Age'),
zaxis = list(title = 'Current Rank')))
prize_class_seasons_vs_age
In order to visualize the classification analysis, we created a 3-dimensional scatterplot using x = Seasons, y = Age, and z = Current.Rank with the colour of each observation being its prize money class. The plot shows that the players who have won the most prize money are generally those who are older in age, have played more seasons, and have a better current ranking.
In conclusion, we found that professional tennis players that are older, more experienced, and have a better current ranking are generally the ones who have made the most money throughout their careers. That being said, the correlation between each of these three variables and prize money is not that strong. This suggests that each variable on its own is not a good predictor of a tennis player’s prize money; however, plotting these variables all together on a 3-dimensional scatterplot shows a general trend regarding which players have made the most money. Therefore, this model could be used for k-nn classification in order to predict how much money a male tennis player could earn based on their age, current ranking, and the number of seasons played.
Before conducting this study, we expected that players who were older, higher ranked, and more experienced would have won more prize money throughout their careers because it intuitively makes sense. Naturally, athletes who have played professionally for a long time and are ranked higher within the league are the ones who have won more tournaments, thus also having won more prize money.
Additionally, we initially expected each individual predictor to have a stronger linear correlation with prize money. However, the exploratory analysis shows that, at best, they have a moderate correlation. This may be due to the abnormal outliers that made our prize money distribution extremely skewed. As mentioned in the introduction, the three top-ranked tennis players have dominated the tennis league for the past decade and won almost all major tournaments. These abnormalities affected the statistics of the dataset and the linear correlations between each variable and prize money, therefore influencing the accuracy of our model as well.
These findings can be significant with helping people to plan for their lives accordingly. Using this model, male tennis players could gain insight into how much money they would make on this career path. Similarly, parents of aspiring players could get an estimate of what factors to consider if their child wants to be successful in this field. This gives them more insights into how they can help their child plan their future.
Additionally, questions involving income often raise concerns regarding equality as well. With that in mind, our model can provide more insight into whether any form of inequality/discrimination is a problem in this industry. For example, suppose our model predicts that a player should be making a lot more than they actually are. In that case, further investigation may be encouraged to determine whether this discrepancy is caused by variables such as nationality, race, etc. In order to improve our world, we need to address equality issues and consider them while analyzing statistics.
There are limitations to our findings that should be noted and improved for future experiments. In particular, 3D data cannot be analyzed as accurately as 2D data due to the distorted effect of perspective; therefore, the data’s readability decreased when we plotted it using the 3D scatterplot. In future studies, it may be better to use different visualization techniques that may be outside the scope of the DSCI 100 course material. This data set and its analysis opens the field to various new questions for future studies such as: “How would removing extreme outliers affect the analysis and results of this prediction model?” “What is the golden age for children to start playing tennis if they want to become professional tennis players?” “Which countries have won the most awards and/or have a better program for developing tennis skills in children?” These questions should be analyzed separately, however the same tidy data and overall methods can be used. For the next experiment, it would be best to minimize the limitations of the study; no data or experiment can be perfect, but we should never stop enhancing and improving.
Gay, J. (2019, Aug 27). Telling the truth about tennis --- noah rubin's instagram series 'behind the racquet' chronicles the humble real world outside the top 10. Wall Street Journal. Retrieved from http://ezproxy.library.ubc.ca/login?url=https://search-proquest-com.ezproxy.library.ubc.ca/docview/2280317831?accountid=14656
Timbers, T., Campbell, T., & Lee, M. (2020). Introduction To Datascience [Ebook]. https://ubc-dsci.github.io/introduction-to-datascience/
Wertheim, J. (2019). Holding Court. Sports Illustrated. 130(16), 22-22.